Rejoinder: Discussion of Bump Hunting in High Dimensional Data
نویسندگان
چکیده
We thank all of the discussants for taking the time to contribute, and for both their compliments and helpful criticism. Before responding in detail, we address two general concerns that ran throughout many of the discussions. 1 Computation Several discussants raised the issue of the computational requirements of the peeling algorithm (Section 7). They were concerned about its ability to scale to large (in memory) data sets. Here we provide some insight in the form of a worst case analysis. Let be the minimum fraction of observations permitted to be trimmed from each box in the peeling sequence. The current version of the software (Friedman 1998) has been modiied to incorporate such a limit for both real{valued and categorical variables. For the latter, this implies that more than one value may need to be trimmed from the current box at some iterations. This requires no additional computation since the values can be considered in the order of their individual output means within each current box. Initially, before any peeling trajectories are constructed, the values of each real{valued variable are sorted in ascending order. The corresponding order information is stored as a linked list so that individual deletions can be performed later in constant time. This requires an initial computation proportional to n R N log N, where n R is the number of real{valued variables and N is the training sample size. An upper bound L on the number of boxes in each trajectory is given by L = log 0 log(1 ?) where 0 is the speciied minimum box size (7.4). Here it is assumed that the minimum fraction of observations is peeled at each iteration of the trajectory. In this case, the number of observations in the current box at the lth step is N l = N(1 ?) l : For each categorical input variable the computation required to compute the peeling criterion (14.2) is proportional to N l. For real{valued variables it is N l since the linked lists can be used, along with updating formulae for the means, so that only the ?fraction of the observations at the extremes need be considered. The total computation bound for producing a peeling trajectory in this case is proportional to
منابع مشابه
The Bump Hunting Using the Decision Tree Combined with the Genetic Algorithm: Extreme-value Statistics Aspect
In difficult classification problems of the z-dimensional points into two groups giving 0-1 responses due to the messy data structure, it is more favorable to search for the denser regions for the response 1 points than to find the boundaries to separate the two groups; this is called the bump hunting. In a series of previous studies, we have shown that a bump hunting method using the decision ...
متن کاملCross-validation and peeling strategies for survival bump hunting using recursive peeling methods
We introduce a framework to build a survival/risk bump hunting model with a censored time-to-event response. Our Survival Bump Hunting (SBH) method is based on a recursive peeling procedure that uses a specific survival peeling criterion derived from non/semi-parametric statistics such as the hazards-ratio, the log-rank test or the Nelson--Aalen estimator. To optimize the tuning parameter of th...
متن کاملDevelopments and Challenges in Mixture Models, Bump Hunting and Measurement Error Models
Bumps, components, clusters and atypical structures from real data often lead to scientific discoveries or reveal interesting phenomena of a population. They are important in astronomy, biology, data mining, bioinformatics and in applications to virtually all natural and social sciences. The wide interest in such structures has in the last decade led to significant developments in each of these...
متن کاملPRIM analysis
This paper analyzes a data mining/bump hunting technique known as PRIM (Fisher and Friedman, 1999). PRIM finds regions in high-dimensional input space with large values of a real output variable. This paper provides the first thorough study of statistical properties of PRIM. Amongst others, we characterize the output regions PRIM produces, and derive rates of convergence for these regions. Sinc...
متن کاملRejoinder to Post-selection shrinkage estimation for high-dimensional data analysis
We sincerely thank all the discussants Kjell Doksum and Joan Fujimura (DF); Jianqing Fan (Fan); Peihua Qiu, Kai Yang, and Lu You (QYY); and Yanming Li, Hyokyoung Grace Hong, and Yi Li (LHL) for the thought-provoking and insightful discussions on our paper. We would also like to thank the Editor Fabrizio Ruggeri for processing and organizing the discussion. Ahmed would like to specially thank hi...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1999